Gaussian Processes (GP) — Regression & Probabilistic Classification#
Gaussian Processes (GPs) are a Bayesian, nonparametric way to do supervised learning.
GP regression: returns a distribution over functions that fit the data.
GP classification: returns probabilities (via a latent GP + a link function).
A GP is best thought of as:
A probability distribution over plausible functions.
Why people like GPs#
Advantages:
Interpolates the observations (for many kernels when the noise is very small).
Probabilistic predictions: you get uncertainty estimates (e.g., 95% intervals).
Versatile kernels: you can encode assumptions (smoothness, periodicity, roughness).
Disadvantages:
Not sparse by default: it uses all data at prediction time.
Training can be expensive: typically \(\mathcal{O}(n^3)\) time and \(\mathcal{O}(n^2)\) memory.
Can struggle in very high-dimensional feature spaces without careful kernel design / scaling.
This notebook focuses on intuition + statistical background (Gaussian conditioning), uses Plotly visuals heavily, implements GP regression from scratch, and then connects the results to scikit-learn for both regression and classification.
Learning goals#
By the end you should be able to:
explain what a GP is (distribution over functions) and what a kernel means
sample functions from a GP prior and interpret how kernels change function shapes
derive GP regression using conditioning a multivariate Gaussian
implement GP regression in NumPy (Cholesky-based)
interpret the posterior mean and posterior variance
understand log marginal likelihood and why it balances fit vs complexity
use
GaussianProcessRegressorandGaussianProcessClassifierinscikit-learnand explain key parameters
Notation#
Inputs: \(X = [x_1,\dots,x_n]^T\) (shape \(n\times d\))
Latent function values: \(f = [f(x_1),\dots,f(x_n)]^T\)
Observations (regression): \(y = f + \varepsilon\), with \(\varepsilon \sim \mathcal{N}(0,\sigma^2 I)\)
Kernel (covariance function): \(k(x,z)\)
Kernel matrix: \(K(X,X)\) where \(K_{ij} = k(x_i, x_j)\)
Table of contents#
The GP idea: a distribution over functions
Kernels as similarity + shape priors (Plotly demos)
GP regression via Gaussian conditioning (math + intuition)
GP regression from scratch (NumPy)
Hyperparameters + log marginal likelihood (Occam’s razor)
scikit-learnGP regression + parameter mapGP classification: latent GP + logistic link + Laplace (intuition)
scikit-learnGP classification (probability maps)Practical tips, limitations, and scaling strategies
Exercises + references
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from dataclasses import dataclass
from scipy.special import expit # sigmoid
from sklearn.gaussian_process import GaussianProcessRegressor, GaussianProcessClassifier
from sklearn.gaussian_process.kernels import (
RBF,
Matern,
ExpSineSquared,
RationalQuadratic,
WhiteKernel,
ConstantKernel,
)
from sklearn.datasets import make_moons
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) The GP idea: a distribution over functions#
A Gaussian Process is defined as:
meaning: for any finite set of inputs \(X=[x_1,\dots,x_n]\), the vector of function values
is multivariate Gaussian:
Intuition:
The mean function \(m(x)\) is often set to 0 (a neutral prior).
The kernel \(k(x,z)\) says how much you expect two points to co-vary.
Sampling from a GP prior gives you random functions consistent with those assumptions.
def _as_2d(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
if X.ndim == 1:
return X.reshape(-1, 1)
return X
def rbf_kernel(
X: np.ndarray,
Z: np.ndarray,
length_scale: float = 1.0,
variance: float = 1.0,
) -> np.ndarray:
X = _as_2d(X)
Z = _as_2d(Z)
length_scale = float(length_scale)
variance = float(variance)
X_norm = np.sum(X**2, axis=1)[:, None]
Z_norm = np.sum(Z**2, axis=1)[None, :]
dist2 = X_norm + Z_norm - 2.0 * (X @ Z.T)
return variance * np.exp(-0.5 * dist2 / (length_scale**2))
def periodic_kernel(
X: np.ndarray,
Z: np.ndarray,
length_scale: float = 1.0,
period: float = 1.0,
variance: float = 1.0,
) -> np.ndarray:
X = _as_2d(X)
Z = _as_2d(Z)
length_scale = float(length_scale)
period = float(period)
variance = float(variance)
# For 1D this is just |x-z|; for d>1 you could generalize with norms.
d = np.abs(X - Z.T)
return variance * np.exp(-2.0 * (np.sin(np.pi * d / period) ** 2) / (length_scale**2))
def sample_gp_prior_1d(
x_grid: np.ndarray,
kernel_fn,
n_samples: int = 5,
jitter: float = 1e-9,
) -> np.ndarray:
X = _as_2d(x_grid)
K = kernel_fn(X, X) + jitter * np.eye(X.shape[0])
return rng.multivariate_normal(mean=np.zeros(X.shape[0]), cov=K, size=n_samples)
def plot_gp_samples_1d(
x: np.ndarray,
samples: np.ndarray,
title: str,
y_mean: np.ndarray | None = None,
y_std: np.ndarray | None = None,
y_train: np.ndarray | None = None,
x_train: np.ndarray | None = None,
) -> go.Figure:
fig = go.Figure()
if y_mean is not None and y_std is not None:
y_mean = np.asarray(y_mean)
y_std = np.asarray(y_std)
upper = y_mean + 1.96 * y_std
lower = y_mean - 1.96 * y_std
fig.add_trace(go.Scatter(x=x, y=upper, mode="lines", line=dict(width=0), showlegend=False))
fig.add_trace(
go.Scatter(
x=x,
y=lower,
mode="lines",
line=dict(width=0),
fill="tonexty",
fillcolor="rgba(31, 119, 180, 0.18)",
name="95% interval",
)
)
fig.add_trace(go.Scatter(x=x, y=y_mean, mode="lines", name="mean", line=dict(width=3)))
for i, s in enumerate(samples):
fig.add_trace(
go.Scatter(
x=x,
y=s,
mode="lines",
line=dict(width=1),
name=f"sample {i+1}",
opacity=0.7,
showlegend=i == 0,
)
)
if x_train is not None and y_train is not None:
fig.add_trace(
go.Scatter(
x=np.asarray(x_train),
y=np.asarray(y_train),
mode="markers",
marker=dict(size=8, color="black"),
name="observations",
)
)
fig.update_layout(title=title, xaxis_title="x", yaxis_title="f(x)")
return fig
x = np.linspace(-5, 5, 220)
kernels = {
"RBF (l=0.4)": lambda X, Z: rbf_kernel(X, Z, length_scale=0.4, variance=1.0),
"RBF (l=1.5)": lambda X, Z: rbf_kernel(X, Z, length_scale=1.5, variance=1.0),
"Periodic (p=2.5)": lambda X, Z: periodic_kernel(X, Z, length_scale=0.6, period=2.5, variance=1.0),
}
fig = go.Figure()
for name, kfn in kernels.items():
k_slice = kfn(x.reshape(-1, 1), np.array([[0.0]])).ravel()
fig.add_trace(go.Scatter(x=x, y=k_slice, mode="lines", name=name))
fig.update_layout(
title="Kernel slice k(x, 0): how correlation decays with distance",
xaxis_title="x",
yaxis_title="k(x, 0)",
)
fig.show()
x_grid = np.linspace(-4, 4, 160)
samples_rbf_short = sample_gp_prior_1d(x_grid, kernels["RBF (l=0.4)"], n_samples=6)
samples_rbf_long = sample_gp_prior_1d(x_grid, kernels["RBF (l=1.5)"], n_samples=6)
samples_per = sample_gp_prior_1d(x_grid, kernels["Periodic (p=2.5)"], n_samples=6)
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=3, subplot_titles=["RBF: short length-scale", "RBF: long length-scale", "Periodic kernel"])
for s in samples_rbf_short:
fig.add_trace(go.Scatter(x=x_grid, y=s, mode="lines", line=dict(width=1), showlegend=False), row=1, col=1)
for s in samples_rbf_long:
fig.add_trace(go.Scatter(x=x_grid, y=s, mode="lines", line=dict(width=1), showlegend=False), row=1, col=2)
for s in samples_per:
fig.add_trace(go.Scatter(x=x_grid, y=s, mode="lines", line=dict(width=1), showlegend=False), row=1, col=3)
fig.update_layout(title="Sampling random functions from GP priors (kernels = shape assumptions)")
fig.show()
2) Kernels as similarity + shape priors#
Kernels do two jobs at once:
Similarity: if \(k(x,z)\) is large, the model believes \(f(x)\) and \(f(z)\) should be similar.
Shape prior: the kernel determines what kind of functions are likely (smooth, wiggly, periodic, etc.).
One useful mental picture is the kernel matrix \(K(X,X)\):
If it has large off-diagonal values, points are strongly coupled → smoother functions.
If it is close to diagonal, points are weakly coupled → rougher / more local behavior.
Xk = x_grid.reshape(-1, 1)
K_short = kernels["RBF (l=0.4)"](Xk, Xk)
K_long = kernels["RBF (l=1.5)"](Xk, Xk)
fig = make_subplots(rows=1, cols=2, subplot_titles=["RBF kernel matrix (short l)", "RBF kernel matrix (long l)"])
fig.add_trace(go.Heatmap(z=K_short, colorscale="Viridis", showscale=False), row=1, col=1)
fig.add_trace(go.Heatmap(z=K_long, colorscale="Viridis", showscale=True), row=1, col=2)
fig.update_layout(title="Kernel matrices: longer length-scale couples points more strongly")
fig.show()
3) GP regression via Gaussian conditioning#
The core math trick in GP regression is: conditioning a joint Gaussian.
Assume a zero-mean prior (for simplicity):
And a Gaussian noise model:
So:
For test points \(X_*\), the joint distribution of training outputs \(y\) and latent test function values \(f_*\) is Gaussian:
Conditioning a Gaussian gives the posterior:
Interpretation:
The posterior mean is a kernel-weighted combination of observed outputs.
The posterior variance shrinks near observed points and grows far away.
def gp_posterior_regression(
X_train: np.ndarray,
y_train: np.ndarray,
X_test: np.ndarray,
kernel_fn,
noise_variance: float = 1e-4,
jitter: float = 1e-9,
) -> tuple[np.ndarray, np.ndarray]:
X_train = _as_2d(X_train)
X_test = _as_2d(X_test)
y_train = np.asarray(y_train, dtype=float).ravel()
n = X_train.shape[0]
K = kernel_fn(X_train, X_train) + (noise_variance + jitter) * np.eye(n)
L = np.linalg.cholesky(K)
# alpha = K^{-1} y via two triangular solves
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
K_s = kernel_fn(X_train, X_test) # n x m
mu = K_s.T @ alpha # m
v = np.linalg.solve(L, K_s) # n x m
K_ss = kernel_fn(X_test, X_test)
cov = K_ss - v.T @ v
return mu, cov
def log_marginal_likelihood_zero_mean(
X_train: np.ndarray,
y_train: np.ndarray,
kernel_fn,
noise_variance: float,
jitter: float = 1e-9,
) -> float:
X_train = _as_2d(X_train)
y_train = np.asarray(y_train, dtype=float).ravel()
n = X_train.shape[0]
K = kernel_fn(X_train, X_train) + (noise_variance + jitter) * np.eye(n)
L = np.linalg.cholesky(K)
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
log_det = 2.0 * float(np.sum(np.log(np.diag(L))))
quad = float(y_train @ alpha)
return -0.5 * quad - 0.5 * log_det - 0.5 * n * np.log(2.0 * np.pi)
# Synthetic regression: noisy samples from a smooth function
def f_true(x: np.ndarray) -> np.ndarray:
return np.sin(0.8 * x) + 0.25 * np.cos(2.0 * x)
n_train = 18
X_train = np.sort(rng.uniform(-4.5, 4.5, size=n_train))
sigma = 0.15
y_train = f_true(X_train) + sigma * rng.normal(size=n_train)
X_test = np.linspace(-6, 6, 400)
k = lambda X, Z: rbf_kernel(X, Z, length_scale=1.2, variance=1.0)
mu, cov = gp_posterior_regression(X_train, y_train, X_test, k, noise_variance=sigma**2)
std = np.sqrt(np.clip(np.diag(cov), 0.0, np.inf))
posterior_samples = rng.multivariate_normal(mu, cov + 1e-8 * np.eye(len(X_test)), size=5)
fig = plot_gp_samples_1d(
X_test,
posterior_samples,
title="GP regression posterior (RBF kernel): mean + uncertainty + samples",
y_mean=mu,
y_std=std,
x_train=X_train,
y_train=y_train,
)
fig.add_trace(go.Scatter(x=X_test, y=f_true(X_test), mode="lines", line=dict(dash="dot"), name="true f(x)", opacity=0.7))
fig.show()
# Interpolation vs smoothing: change the assumed noise
noise_levels = [1e-6, sigma**2, 0.5**2]
titles = ["almost noiseless (interpolates)", "matched noise", "high noise (smoother)"]
fig = make_subplots(rows=1, cols=3, subplot_titles=titles)
for col, nv in enumerate(noise_levels, start=1):
mu_c, cov_c = gp_posterior_regression(X_train, y_train, X_test, k, noise_variance=nv)
std_c = np.sqrt(np.clip(np.diag(cov_c), 0.0, np.inf))
fig.add_trace(go.Scatter(x=X_test, y=mu_c + 1.96 * std_c, mode="lines", line=dict(width=0), showlegend=False), row=1, col=col)
fig.add_trace(
go.Scatter(
x=X_test,
y=mu_c - 1.96 * std_c,
mode="lines",
line=dict(width=0),
fill="tonexty",
fillcolor="rgba(31, 119, 180, 0.18)",
showlegend=False,
),
row=1,
col=col,
)
fig.add_trace(go.Scatter(x=X_test, y=mu_c, mode="lines", line=dict(width=3), showlegend=False), row=1, col=col)
fig.add_trace(go.Scatter(x=X_train, y=y_train, mode="markers", marker=dict(size=6, color="black"), showlegend=False), row=1, col=col)
fig.update_layout(title="Noise controls interpolation vs smoothing")
fig.show()
4) GP regression from scratch: what the equations are doing#
The posterior mean can be rewritten as:
Think of it like this:
\(k(x_*,X)\) measures similarity between \(x_*\) and each training point.
\([K+\sigma^2 I]^{-1}y\) produces weights that account for redundancy between training points.
So GP regression isn’t just “weighted averaging” — it’s weighted averaging corrected for correlations.
Also notice the computation bottleneck: factoring/inverting the \(n\times n\) matrix \(K+\sigma^2 I\).
That’s why we used a Cholesky factorization above.
5) Hyperparameters + log marginal likelihood (Occam’s razor)#
Kernels have hyperparameters (length-scale, amplitude, periodicity, …). In GP regression we can fit them by maximizing the log marginal likelihood:
Interpretation:
The quadratic term (\(y^T K_y^{-1}y\)) rewards data fit.
The log-determinant term (\(\log|K_y|\)) penalizes overly flexible models (it’s a built-in complexity penalty).
That balance is why people say GPs implement an “Occam’s razor” automatically.
# Visualize the log marginal likelihood over a small grid of (length_scale, noise)
length_scales = np.linspace(0.2, 2.8, 80)
noises = np.linspace(0.02, 0.6, 80) # noise std
LML = np.zeros((len(noises), len(length_scales)))
for i, ns in enumerate(noises):
for j, ls in enumerate(length_scales):
k_ij = lambda X, Z, ls=ls: rbf_kernel(X, Z, length_scale=ls, variance=1.0)
LML[i, j] = log_marginal_likelihood_zero_mean(X_train, y_train, k_ij, noise_variance=ns**2)
best = np.unravel_index(np.argmax(LML), LML.shape)
best_noise, best_ls = noises[best[0]], length_scales[best[1]]
print("best length_scale ~", float(best_ls), "best noise std ~", float(best_noise))
fig = px.imshow(
LML,
x=length_scales,
y=noises,
origin="lower",
aspect="auto",
color_continuous_scale="Viridis",
title="Log marginal likelihood surface (RBF): fit vs complexity tradeoff",
)
fig.update_layout(xaxis_title="length_scale", yaxis_title="noise std")
fig.add_trace(
go.Scatter(
x=[best_ls],
y=[best_noise],
mode="markers",
marker=dict(size=12, color="red", symbol="x"),
name="max",
)
)
fig.show()
best length_scale ~ 1.351898734177215 best noise std ~ 0.16683544303797468
# A quick "adaptive fitting" demo: pick the next point where uncertainty is largest.
def gp_std_on_grid(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, kernel_fn, noise_variance: float) -> np.ndarray:
mu_, cov_ = gp_posterior_regression(X_train, y_train, X_test, kernel_fn, noise_variance=noise_variance)
return np.sqrt(np.clip(np.diag(cov_), 0.0, np.inf))
X_train_a = X_train.copy()
y_train_a = y_train.copy()
std_before = gp_std_on_grid(X_train_a, y_train_a, X_test, k, noise_variance=sigma**2)
x_new = float(X_test[np.argmax(std_before)])
y_new = float(f_true(np.array([x_new]))[0] + sigma * rng.normal())
X_train_a = np.sort(np.r_[X_train_a, x_new])
y_train_a = np.r_[y_train_a, y_new][np.argsort(np.r_[X_train, x_new])]
std_after = gp_std_on_grid(X_train_a, y_train_a, X_test, k, noise_variance=sigma**2)
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=["Uncertainty before adding a new point", "Uncertainty after adding it"])
fig.add_trace(go.Scatter(x=X_test, y=std_before, mode="lines", name="std"), row=1, col=1)
fig.add_trace(go.Scatter(x=[x_new], y=[std_before.max()], mode="markers", marker=dict(size=10, color="red"), name="chosen x"), row=1, col=1)
fig.add_trace(go.Scatter(x=X_test, y=std_after, mode="lines", showlegend=False), row=2, col=1)
fig.add_trace(go.Scatter(x=[x_new], y=[std_after[np.argmax(std_before)]], mode="markers", marker=dict(size=10, color="red"), showlegend=False), row=2, col=1)
fig.update_layout(title="Adaptive sampling intuition: query where the GP is most uncertain")
fig.update_xaxes(title_text="x", row=2, col=1)
fig.update_yaxes(title_text="posterior std", row=1, col=1)
fig.update_yaxes(title_text="posterior std", row=2, col=1)
fig.show()
print("picked x_new=", x_new, "| y_new=", y_new)
picked x_new= 6.0 | y_new= -0.6056786127882988
6) scikit-learn GP regression + parameter map#
scikit-learn provides GaussianProcessRegressor (GPR). It optimizes kernel hyperparameters by maximizing the log marginal likelihood.
Key parameters:
kernel: the covariance function. If you include bounds, sklearn will optimize those parameters.alpha: added to the diagonal during fitting.can represent observation noise variance (or just numerical jitter)
normalize_y: center/scale the targets internally (often helpful)n_restarts_optimizer: tries multiple initializations (helps avoid bad local optima)random_state: reproducibility for the optimizer restarts
Common kernel pattern:
ConstantKernelsets overall amplitudeRBF(orMatern) sets smoothness/length-scaleWhiteKernelexplicitly models noise
Xtr = X_train.reshape(-1, 1)
Xte = X_test.reshape(-1, 1)
kernel = ConstantKernel(1.0, (1e-2, 1e2)) * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2)) + WhiteKernel(
noise_level=sigma**2,
noise_level_bounds=(1e-6, 1e0),
)
gpr = GaussianProcessRegressor(
kernel=kernel,
alpha=0.0,
normalize_y=True,
n_restarts_optimizer=5,
random_state=42,
)
gpr.fit(Xtr, y_train)
print("Learned kernel:")
print(gpr.kernel_)
print("log marginal likelihood:", float(gpr.log_marginal_likelihood(gpr.kernel_.theta)))
mu_sk, std_sk = gpr.predict(Xte, return_std=True)
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_test, y=mu_sk + 1.96 * std_sk, mode="lines", line=dict(width=0), showlegend=False))
fig.add_trace(
go.Scatter(
x=X_test,
y=mu_sk - 1.96 * std_sk,
mode="lines",
line=dict(width=0),
fill="tonexty",
fillcolor="rgba(0, 150, 136, 0.18)",
name="95% interval",
)
)
fig.add_trace(go.Scatter(x=X_test, y=mu_sk, mode="lines", name="mean", line=dict(width=3, color="rgb(0,150,136)")))
fig.add_trace(go.Scatter(x=X_train, y=y_train, mode="markers", marker=dict(size=8, color="black"), name="observations"))
fig.add_trace(go.Scatter(x=X_test, y=f_true(X_test), mode="lines", line=dict(dash="dot"), name="true f(x)", opacity=0.7))
fig.update_layout(title="GaussianProcessRegressor (sklearn): posterior mean + 95% interval", xaxis_title="x", yaxis_title="y")
fig.show()
Learned kernel:
1.23**2 * RBF(length_scale=1.29) + WhiteKernel(noise_level=0.055)
log marginal likelihood: -12.687509884300587
7) GP classification: latent GP + logistic link + Laplace (intuition)#
For classification we want probabilities:
A common GP classification model introduces a latent function \(f(x)\) with a GP prior, then passes it through a link function (logistic):
The big difference from regression:
The likelihood is not Gaussian anymore.
So the posterior \(p(f\mid y)\) is not exactly Gaussian.
A standard approach is the Laplace approximation:
Find the mode \(\hat{f}\) of the log posterior.
Approximate the posterior with a Gaussian around that mode.
scikit-learn’s GaussianProcessClassifier implements this (logistic link + Laplace).
Below we focus on building intuition via probability maps and uncertainty regions.
# 2D probabilistic classification demo with sklearn
X, y = make_moons(n_samples=220, noise=0.25, random_state=42)
# GPs are sensitive to feature scales, so standardize.
kernel_rbf = 1.0 * RBF(length_scale=1.0)
gpc_rbf = make_pipeline(
StandardScaler(),
GaussianProcessClassifier(
kernel=kernel_rbf,
n_restarts_optimizer=2,
max_iter_predict=100,
random_state=42,
),
)
gpc_rbf.fit(X, y)
# build a grid for visualization
pad = 0.6
x_min, x_max = float(X[:, 0].min() - pad), float(X[:, 0].max() + pad)
y_min, y_max = float(X[:, 1].min() - pad), float(X[:, 1].max() + pad)
xs = np.linspace(x_min, x_max, 260)
ys = np.linspace(y_min, y_max, 260)
xx, yy = np.meshgrid(xs, ys)
grid = np.c_[xx.ravel(), yy.ravel()]
proba = gpc_rbf.predict_proba(grid)[:, 1].reshape(xx.shape)
fig = go.Figure()
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=proba,
colorscale="Viridis",
contours=dict(coloring="fill"),
colorbar=dict(title="P(y=1)"),
)
)
# decision boundary at 0.5
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=proba,
showscale=False,
contours=dict(coloring="none", start=0.5, end=0.5, size=1),
line=dict(color="black", width=2),
hoverinfo="skip",
)
)
colors = np.where(y == 1, "#1f77b4", "#d62728")
fig.add_trace(
go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
marker=dict(size=7, color=colors, line=dict(color="white", width=0.5)),
name="data",
)
)
fig.update_layout(
title="GaussianProcessClassifier (RBF): probability map + decision boundary",
xaxis_title="x1",
yaxis_title="x2",
)
fig.show()
print("Learned kernel (classifier):")
print(gpc_rbf.named_steps["gaussianprocessclassifier"].kernel_)
Learned kernel (classifier):
6.77**2 * RBF(length_scale=0.949)
# Compare kernels for classification: RBF vs Matern (different smoothness assumptions)
gpc_matern = make_pipeline(
StandardScaler(),
GaussianProcessClassifier(
kernel=1.0 * Matern(length_scale=1.0, nu=1.5),
n_restarts_optimizer=2,
max_iter_predict=100,
random_state=42,
),
)
gpc_matern.fit(X, y)
proba_rbf = gpc_rbf.predict_proba(grid)[:, 1].reshape(xx.shape)
proba_mat = gpc_matern.predict_proba(grid)[:, 1].reshape(xx.shape)
fig = make_subplots(rows=1, cols=2, subplot_titles=["RBF kernel", "Matern (nu=1.5)"])
for col, P in enumerate([proba_rbf, proba_mat], start=1):
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=P,
colorscale="Viridis",
contours=dict(coloring="fill"),
showscale=(col == 2),
colorbar=dict(title="P(y=1)"),
),
row=1,
col=col,
)
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=P,
showscale=False,
contours=dict(coloring="none", start=0.5, end=0.5, size=1),
line=dict(color="black", width=2),
hoverinfo="skip",
),
row=1,
col=col,
)
fig.add_trace(
go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
marker=dict(size=6, color=colors, line=dict(color="white", width=0.5)),
showlegend=False,
),
row=1,
col=col,
)
fig.update_layout(title="Kernel choice changes the function prior → changes the classifier")
fig.show()
print("RBF kernel learned:")
print(gpc_rbf.named_steps["gaussianprocessclassifier"].kernel_)
print("Matern kernel learned:")
print(gpc_matern.named_steps["gaussianprocessclassifier"].kernel_)
RBF kernel learned:
6.77**2 * RBF(length_scale=0.949)
Matern kernel learned:
8.1**2 * Matern(length_scale=1.91, nu=1.5)
8) Practical tips, limitations, and scaling strategies#
Practical tips#
Scale inputs: kernels are distance-based; feature scaling changes everything.
Start with RBF (very smooth) or Matern (rougher) and let the optimizer tune length-scales.
For regression, include a WhiteKernel (explicit noise) or use
alpha.Use
n_restarts_optimizerif hyperparameters are sensitive.
Complexity (why GPs don’t scale by default)#
Training typically needs a Cholesky of an \(n\times n\) matrix: \(\mathcal{O}(n^3)\).
Storing the kernel matrix costs \(\mathcal{O}(n^2)\) memory.
High-dimensional inputs#
In high dimensions, distances become less informative and kernels can behave poorly.
Common fixes:
dimensionality reduction / feature engineering
ARD kernels (separate length-scale per feature)
sparse/approximate GP methods (inducing points, random Fourier features)
Exercises#
Build a GP regression model with a periodic kernel and fit data from a periodic function. How does the posterior behave outside the observed range?
For regression, compare
RBFvsMatern(nu=0.5, 1.5, 2.5). Which priors look most realistic for noisy data?On the moons dataset, sweep over
length_scale(fix it instead of optimizing) and visualize under/over-fitting.Explain (in words) why the log marginal likelihood includes a \(\log|K_y|\) term and how it relates to model complexity.
References#
Rasmussen, Williams (2006): Gaussian Processes for Machine Learning (the classic)
scikit-learndocs:GaussianProcessRegressor,GaussianProcessClassifier,sklearn.gaussian_process.kernels